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SUMMARY 


Introduction 


Many,  if  not  most,  probability  assessments  come  in  the  form  of  statements 
like,  "I  am  XX%  certain  that  the  answer  to  this  question  is  Y."  A series 
of  5 experiments  exploring  the  validity  of  such  probability  judgments  are 
described  in  this  report.  Such  judgments  appear  to  have  a moderate  but 
systematic  bias  which  is  surprisingly  Insensitive  to  some  factors  (like 
the  intelligence  or  expertise  of  the  assessor)  and  surprisingly  sensitive 
to  others  (the  difficulty  of  the  question).  The  implications  of  these 
results  for  decision  making  are  discussed. 

Background  and  Approach 

Most  important  decisions  involve  uncertainty.  That  uncertainty  is  typically 
quantified  in  subjective  probability  assessments  of  the  form  "I  am  XX% 
certain  that  proposition  Y is  true."  Proposition  Y might  be:  "There  will 

be  no  major  outbreaks  on  Cyprus  before  the  end  of  the  year"  or  "Appropriation 
Z will  be  approved  as  requested."  Although  it  is  generally  impossible  to 
assess  the  validity  of  any  one  such  probability  assessment,  a set  of  such 
estimates  can  be  evaluated  according  to  their  "degree  of  calibration." 

They  will  be  perfectly  calibrated  if  the  XXX  of  the  propositions  assigned 
probability  XX  turn  out  to  be  true  (e.g.,  50%  of  those  given  a .50  chance 
of  being  true).  Any  systematic  bias  in  such  probability  estimates  could 
lead  to  inaccuracies  in  decisions  relying  on  them. 

Five  experiments,  involving  over  five  hundred  people,  studied  the  calibration 
of  subjective  probability  assessments  assigned  to  propositions  regarding 
a wide  variety  of  general  knowledge  questions.  For  each  question,  people 
chose  one  of  two  possible  answers  as  the  correct  one,  and  then  gave  the 
probability  that  their  answer  was  correct. 

Findings 

1.  People's  probability  estimates  show  moderate  validity  for  all  but  the 
most  difficult  tasks. 

2.  The  most  common  sources  of  invalidity  are:  (a)  overconfidence:  people 

believe  that  they  know  more  than  they  actually  do;  (b)  insensitivity: 
people  believe  that  they  can  discern  finer  distinctions  in  their  own 
subjective  uncertainty  than  they  actually  can. 

3.  People  are  no  better  calibrated  when  dealing  with  questions  in  their 
own  area  of  expertise  than  when  dealing  with  general  knowledge  questions. 

4.  Intelligence  of  the  assessor  ha6  no  effect  on  calibration. 

5.  Calibration  changes  markedly  with  questions  of  different  difficulty. 
Although  people  are  typically  overconfident,  that  overconfidence  increases 
as  questions  get  more  difficult  and  changes  to  underconf idence  with  the 
easiest  questions. 
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Several  simulations  were  conducted  to  make  certain  that  these  conclusions 
were  not  artif actual. 

Implications 

Sophisticated  decision  analyses  typically  include  sensitivity  analyses 
which  show  how  sensitive  their  conclusions  are  to  errors  in  the  probability 
and  utility  estimates  on  which  they  are  based.  Results  of  the  present 
experiments  show  what  range  of  errors  should  be  included  in  sensitivity 
analyses.  Further  work  is  needed  to  see  if  different  kinds  of  questions 
and  different  ways  of  asking  for  probabilities  produce  similar  errors — 
and  if  there  is  one  best  way  to  elicit  probabilities. 

Any  systematic  bias  in  probability  assessment  suggests  the  following 
intriguing  possibility:  instead  of  using  the  biased  probabilities  that 

people  give  us,  why  not  use  corrected  estimates  that  take  known  biases 
into  consideration.  If,  for  example  (as  shown  in  "The  Certainty  Illusion" 
by  Slovic,  Fischhoff,  and  Lichtenstein),  people  should  be  saying  .90  when 
they  say  .99,  why  not  treat  any  estimate  of  .99  as  though  it  were  actually 
.90.  The  exact  correction  would  presumably  vary  from  situation  to 
situation.  Findings  (3)  and  (4)  make  this  situational  adjustment  easier 
by  showing  that  intelligence  and  expertise  are  two  factors  that  need  not 
be  considered.  Finding  (5)  poses  a real  problem  for  this  approach:  If 

the  error  in  calibration  depends  on  the  difficulty  of  the  questions,  then 
efficient  correction  requires  knowledge  of  question  difficulty.  To  know 
that,  we  must  know  the  right  answers  to  the  questions.  If  we  know  that 
(e.g.,  if  we  know  that  there  will  be  an  outbreak  on  Cyprus  before  the  end 
of  the  year),  then  we  have  no  need  for  probabilities.  The  report  suggests 
one  way  of  capitalizing  on  changes  in  the  probabilities  people  use  to 
provide  an  indicator  of  how  difficult  the  questions  are — and  what  sort  of 
correction  factor  should  be  used.  However,  it  concludes  that  the  best  way 
to  resolve  this  problem  is  to  improve  probability  estimation  by  training, 
and  thus  do  away  with  the  correction  problem. 


TABLE  OF  CONTENTS 


SUMMARY 

LIST  OF  FIGURES 

LIST  OF  TABLES 

ACKNOWLEDGMENT 

INTRODUCTION 

ALL  EXPERIMENTS 

NO  KNOWLEDGE 

A LITTLE  KNOWLEDGE 

DIFFERENT  LEVELS  OF  KNOWLEDGE 

EFFECTS  OF  CHANCE  FLUCTUATIONS 

TESTS  VARYING  IN  DIFFICULTY  VERSUS  SUB-TESTS  VARYING 
IN  DIFFICULTY 

EXPERTISE 

INTELLIGENCE 

DISTRIBUTION  OF  RESPONSES 
DISCUSSION 
REFERENCES 
DISTRIBUTION  LIST 


DD  1473 


LIST  OF  FIGURES 


f 


1.  Exemplar  calibration  curves 

2.  Experiment  Is  Calibration  with  no  knowledge 

3.  Experiment  2:  "Mensa  mea  bona  est."  Effects  of  training 

on  calibration 

A.  Experiment  3:  Overall  calibration  curve.  General 

knowledge  items,  regular  subjects 

5.  Experiment  3 again:  Best  versus  worst  subjects 

6.  Experiment  3 yet  again.  Calibration  split  six  ways, 
by  subjects  and  by  items 

7.  Experiment  A:  Replication  of  the  results  of  Figure  6, 

this  time  with  graduate  students  in  psychology 

8.  Results  of  a simulation  to  parallel  the  findings  of  the 
previous  figure 

9.  Complete  test  versus  subset  of  a test 

10.  Effects  of  expertise 

11.  Effects  of  brains 

12.  Distributions  of  subjects'  responses 


Page 

A 

8 

11 

12 

13 

16 

18 

20 

22 

2A 

26 

27 


LIST  OF  TABLES 


Table 


1.  Summary  Table  of  Calibration  statistics  for 
Experiments  3 and  4 


ACKNOWLEDGMENT 


: 

; 

r 


Support  for  this  research  performed  by  Oregon  Research  Institute, 
was  provided  by  the  Advanced  Research  Projects  Agency  of  the  Department 
of  Defense  and  was  monitored  under  Contract  N00014-76-C-0074  with  the 
Office  of  Naval  Research,  under  subcontract  from  Decisions  and  Designs, 
Inc. 


■.y- 


TEE  EFFECT  OF  KNOWLEDGE  ON  THE  CALIBRATION 


OF  PROBABILITY  ASSESSMENTS 
INTRODUCTION 

Dealing  with  uncertainty  is  a central  challenge  in  our  day-to-day 
lives.  In  order  to  manage  our  affairs  effectively,  we  must  make  predic- 
tions about  the  future  behavior  of  individuals,  groups,  social  systems, 
economies  and  international  engagements.  Reflecting  this  situation, 
subjective  probabilities,  the  numerical  expression  of  our  predictions, 
have  found  their  way  into  psychological  theories  of  such  diverse  phenom- 
ena as  motivation  (Feather,  1959;  Weiner,  1974),  attitudes  (Fishbein,  1967), 
personality  attributions  (Jones  & Davis,  1965),  decision  making  (Edwards 
& Tversky,  1967),  choice  behavior  (Krantz,  Luce,  Suppes  & Tversky,  1974), 
and  gambling  (Cohen,  1960).  Subjective  probabilities  are  also  an  integral 
part  of  sophisticated  techniques  like  cost-benefit  analysis  and  decision 
analysis  that  are  used  heavily  in  both  business  and  social  contexts  (e.g.. 
Atomic  Energy  Commission,  1975;  Raiffa,  1968;  Slovic,  Kunreuther,  & White, 

19  74). 

The  quality  of  people's  probability  assessments  sets  an  upper  limit 
on  the  quality  of  their  functioning  in  uncertain  environments.  Knowing 
how  good  people  are  at  assessing  probabilities  clearly  has  both  theoretical 
and  applied  importance. 

One  approach  to  validating  probability  assessments  is  to  restrict  one's 
attention  to  situations  in  which  a "correct"  probability  can  be  consensually 


defined,  for  example,  situations  where  extensive  f requentistic  data  are 


available  and  probabilities  are  essentially  estimates  of  relative  frequen- 
cies. Peterson  and  Beach  (1967)  reviewed  a number  of  studies  that  adopted 
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this  approach  and  found  that  people  can  estimate  relative  frequencies 
quite  well.  More  recently,  however,  Tversky  and  Kahneman  (1973)  have 
suggested  systematic  biases  that  may  be  present  in  such  judgments. 

For  many  tasks,  however,  a consensually  defined  "correct"  answer  is 
unavailable.  This  is  particularly  true  for  probabilities  reflecting 
judges'  degrees  of  belief  in  propositions  concerning  "unique"  events  (e.g., 
what  is  the  probability  that  Portugal  will  withdraw  from  NATO  within  six 
months?)  or  the  judge's  knowledge  about  specific  items  of  information 
(e.g.,  what  is  the  probability  that  absinthe  is  a precious  stone?).  Such 
judgments  reflect  a degree  of  confidence  entirely  internal  to  the  judge. 
Even  if  we  know  that  Portugal  did  not  withdraw  from  NATO  during  the  period 
specified,  or  that  absinthe  is  a liqueur,  we  can  say  nothing  about  how 
adequately  the  judge  assessed  and  reported  his  or  her  own  uncertainty. 

There  is  no  way  to  evaluate  an  isolated  judgment  of  this  type. 

Often,  however,  the  judge  makes  many  such  responses,  assessing  the 
probability  of  many  different  unique  events  occurring  or  propositions  being 
true.  Over  such  a set  of  judgments,  validity  can  be  sought.  One  method 
of  evaluating  the  quality  of  a set  of  probability  judgments  is  to  look  at 
the  internal  consistency  or  coherence  of  the  set.  To  be  valid,  subjective 
probability  judgments  must  follow  the  axioms  of  the  probability  calculus. 
For  example,  since  the  two  propositions  given  above  are  independent,  the 
probability  of  both  being  true  ("Portugal  will  withdraw  from  NATO"  and 
"absinthe  is  a precious  stone")  should  be  equal  to  the  product  of  the  proba- 
bilities of  each  being  true.  Wyer  (1974)  adopted  this  approach  in  a large 
number  of  studies  and  found  a good  deal  of  evidence  of  inconsistency,  per- 
haps the  most  interesting  aspect  of  which  was  a tendency  to  overestimate 
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the  likelihood  of  compound  events.  Internal  consistency  is  a necessary 
condition  for  the  validity  of  individual  probability  estimates,  but  it  is 
not  sufficient.  Large  systematic  biases  may  exist  in  entirely  consistent 
judgments. 

A more  direct  method  for  evaluating  the  validity  of  a judge's  assess- 
ments is  to  look  at  what  we  will  call  his  or  her  degree  of  calibration. 

Assume  that  the  true  outcome  of  every  proposition  in  the  set  is  eventually 
known  (by  waiting  six  months  to  see  what  happens  to  Portugal,  or  by  looking 
up  absinthe  in  a dictionary).  Then  a judge  is  perfectly  calibrated  if, 
over  the  long  run,  for  all  propositions  assigned  the  same  probability, 
the  proportion  that  are  true  is  equal  to  the  probability  assigned.  Thus, 
across  that  subset  of  answers  to  which  the  perfectly  calibrated  assessor 
assigns  a probability  of  being  correct  of  .7,  70%  should  be  correct,  and 
for  all  proportions  to  which  .8  is  assigned,  80%  should  be  correct.  For 
an  assessor  producing  a large  number  of  responses,  one  may  group  like 
responses  and  observe  the  hit  rate  for  each  subgroup.  A graph  showing 
the  hit  rate  (percent  correct)  for  each  probability  response  is  called 
a "calibration  curve."  Calibration  curve  A in  Figure  1 reflects  under- 
confidence: whenever  such  a person  says  .7,  88%  of  the  answers  are  correct — 

such  people  know  more  than  their  responses  indicate.  Curve  B,  the 
diagonal,  represents  perfect  calibration.  Curve  C represents  overconfi- 
dence; for  example,  only  47%  of  all  the  events  to  which  the  judge  responds 
.7  are  indeed  correct. 

While  a number  of  investigators  have  studied  calibration,  the  only 
consistent  finding  has  been  that  judges  tend  to  be  overconfident  (for  a 
review  of  this  literature,  see  Lichtenstein,  Fischhoff,  & Phillips,  1976). 
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Figure  1 

Exemplar  calibration  curves 
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The  present  studies  constitute  a systematic  look  at  how  well  people  are 
calibrated  and  what  affects  their  degree  of  calibration.  In  particular, 
we  want  to  know  whether  the  amount  of  knowledge  a judge  possesses  about 
the  content  of  the  propositions  being  assessed  affects  his  or  her 
calibration.  Earlier  studies  (Adams  & Adams,  1961;  Clarke,  1960;  Pitz, 
1974;  and  Pollack  & Decker,  1958)  have  reported  some  evidence  that  people 
who  know  more  are  better  calibrated.  The  work  reported  here  provides 
replication,  clarification  and  extension  of  these  findings. 

All  Experiments 

Certain  features  shared  by  all  experiments  are  reported  here  to  avoid 
repetition. 

Subjects . Except  for  Experiment  4,  all  subjects  were  paid  volunteers 
who  responded  to  advertisements  in  the  University  of  Oregon  student  news- 
paper. Except  for  Experiment  4,  the  reported  task  was  performed  as  part 
of  a two-hour  group  session  along  with  several  other  judgmental  tasks. 
Group  size  varied  from  25  to  48  persons. 

Tasks . All  test  items  were  dichotomous  items  with  the  general  form 
"Absinthe  is  (a)  a precious  stone,  (b)  a liqueur."  In  all  experiments, 
subjects  made  two  responses  to  each  item.  First,  they  chose  one  of  the 
two  alternatives  as  their  best  guess  at  the  correct  alternative.  Second, 
they  indicated  with  a number  from  .5  to  1.0  the  probability  that  their 
choice  was  correct. 

Measures . For  each  experiment,  we  report:  (1)  the  percentage  of 

questions  for  which  the  correct  alternative  was  selected;  (2)  subjects’ 
mean  probability  response;  and  (3)  a calibration  curve. 

Calibration  curves  were  constructed  by  grouping  (over  subjects  and 
items)  all  the  responses  in  the  ranges  .50-. 59,  .60-. 69,  .70-. 79,  .80-. 89, 
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.90-. 99,  and  1.00.  The  mean  response  for  each  grouping  is  plotted  against 
the  percent  correct  (hit  rate)  associated  with  those  responses. 
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No  Knowledge 

Experiments  la  and  lb  investigated  the  calibration  of  subjects  with 
severely  limited  knowledge. 

Experiment  la 

Method.  Each  of  92  subjects  was  asked  to  decide,  for  each  of  12 
small  drawings,  whether  the  artist  was  a European  child  or  an  Asian  child, 
and  to  estimate  the  probability  that  their  selection  was  correct.  Each 
set  contained  six  drawings  made  by  children  from  European  countries  and 
six  drawings  from  Asian  countries,  all  taken  from  Kellogg  (1970),  who  had 
selected  them  to  illustrate  her  thesis  that  children's  drawings  are  the 
same  the  world  over.  This  suggested  that  discrimination  according  to 
national  origin  would  be  very  difficult.  The  test  session  was  preceded 
by  a brief  study  period  in  which  the  subjects  were  informed  of  Kellogg's 
thesis . 

Results.  As  expected,  the  subjects  had  difficulty  with  this  task. 

Only  53.2%  of  their  1104  answers  were  correct.  Their  probabilistic  responses, 
however,  indicated  undue  confidence,  with  a mean  response  of  .677.  The 
calibration  curve  shown  in  Figure  2 strongly  suggests  that  these  subjects 
were  unaware  of  how  little  they  knew.  There  is  no  relationship  between 
their  probability  responses  and  the  associated  hit  rates. 

Experiment  lb 

Method . Sixty-three  subjects  were  taught  how  to  read  the  stock 
market  charts  for  individual  companies  provided  by  the  weekly  Standard 
and  Poor  report,  Trendline.  After  the  instruction  period,  they  were  given 
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charts  of  twelve  stocks  with  data  for  the  period  from  July  9,  1974  to 
February  14,  1975.  For  each  stock,  they  were  asked  to  indicate  whether 
its  March  22  closing  price  was  higher  or  lower  than  that  of  February  14. 
Each  of  four  test  sets  included  six  stocks  that  had  increased  and  six 
that  had  decreased  over  the  period,  chosen  at  random  from  all  stocks 
that  appeared  in  Trendline  for  February  14,  1975.  Global  market  indices 
(e.g.,  Dow-Jones)  were  similar  for  February  14,  the  last  day  shown  on  the 
charts,  and  March  22,  the  target  date,  indicating  that  the  market  as  a 
whole  neither  increased  nor  decreased  during  this  period. 

Results . Again,  the  task  was  too  difficult  for  subjects  to  perform 
adequately.  Only  47.2%  of  their  choices  were  correct.  Again,  they  over- 
estimated their  knowledge,  providing  a mean  probability  of  .654.  The 
calibration  curve  shown  in  Figure  2 shows  the  same  insensitivity  of  proba- 
bility judgments  to  level  of  knowledge  found  in  Experiment  la. 

Comment . The  lack  of  calibration  evinced  by  the  subjects  in  these 
two  studies  does  not  logically  follow  from  their  lack  of  knowledge.  Sub- 
jects would  have  been  quite  well  calibrated  had  they  always  given  a prob- 
abilistic response  of  .5.  This  would  have  resulted  in  but  one  data  point 
on  the  calibration  curve  for  each  experiment,  but  that  point  would  have 
fallen  reasonably  close  to  the  perfect  calibration  line.  Only  7 of  the 
155  subjects  in  Experiments  la  and  lb  acknowledged  the  limits  of  their  own 
knowledge  by  following  this  strategy. 

A Little  Knowledge 

Will  a small  amount  of  knowledge  improve  calibration?  Experiment  2 
was  designed  to  investigate  this  possibility  by  partially  training  subjects 
to  make  the  requisite  discrimination. 
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SUBJECTS'  RESPONSE 


Figure  2 

Experiment  1:  Calibration  with  no  knowledge 


Experiment  2 


Method.  The  stimuli  were  examples  of  the  Latin  phrase,  "Mensa  mea 
bona  est,"  handwritten  by  either  European  or  American  adults.  Twenty 
specimens  were  chosen  on  the  basis  of  a pretest  of  20  American  subjects 
who  were  asked  to  sort  100  such  specimens  into  two  piles,  American  and 
European.  The  percent  of  correct  identifications  for  the  20  specimens 
chosen  for  the  experiment  ranged  from  .40  to  .60.^  These  20  specimens 
were  randomly  divided  into  two  sets  of  10,  each  of  which  included  5 
European  and  5 American  specimens.  One  set  was  used  as  training  stimuli; 
the  other  was  used  as  test  stimuli.  This  random  division  was  performed 
four  times,  producing  four  paired  sets  of  training  and  test  stimuli. 

Two  of  four  groups  of  subjects  (N  = 52)  received  training  on  this 
task.  In  the  training  phase,  they  were  asked  to  study  for  five  minutes 
the  ten  training  stimuli,  each  correctly  labeled.  Immediately  following 
this  rudimentary  training,  the  ten  test  stimuli  were  presented.  For  each, 
the  subjects  were  asked  to  indicate  whether  the  specimen  was  European 
or  American,  and  to  assess  the  probability  that  their  answer  was  correct. 
They  were  not  told  how  many  of  the  ten  test  stimuli  were  American. 

The  procedure  for  the  two  groups  of  untrained  subjects  (N  = 57)  was 
identical  except  that  the  specimens  they  studied  in  the  first  phase  were 
not  labeled  as  to  country  of  origin. 

Results . Training  was  moderately  successful;  the  trained  subjects 
correctly  identified  71.4%  of  the  specimens,  compared  with  51.2%  for  un- 
trained subjects.  The  mean  responses  were  .779  for  the  trained  group. 


We  are  grateful  to  Lewis  Goldberg,  out  of  whose  files  we  stole,  without 
his  knowledge,  the  handwriting  specimens  and  the  pretest  results. 
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.653  for  the  untrained  group.  As  Figure  3 shows,  trained  subjects  not 
only  knew  more,  but  also  were  better  calibrated;  the  untrained  subjects, 
as  in  Experiment  1,  showed  no  evidence  of  calibration. 

Different  Levels  of  Knowledge 

The  suggestion  that  greater  knowledge  improves  calibration  was  further 
explored  in  Experiment  3. 

Experiment  3 

Method . The  stimuli  were  150  general  knowledge  items  with  highly 
varied  content  (e.g.,  Aden  was  occuped  in  1839  by  the  [a]  British,  [b] 
French;  Bile  pigments  accumulate  as  a result  of  a condition  known  as  [a] 
gangrene,  [b]  jaundice).  One  hundred  twenty  subjects  each  responded  to 
75  items  drawn  from  a pool  of  150  items;  25  'of  the  items  received  80 
responses,  100  items  received  60  responses,  and  25  items  received  40 
responses . 

Results . Figure  4 presents  the  calibration  curve  over  all  9,000  re- 
sponses. It  is  substantially  flatter  than  it  should  be.  The  hit  rates 
associated  with  the  responses  .50  and  .60,  and  with  .70  and  .80,  were  vir- 
tually identical.  Subjects  generally  overestimated  the  extent  of  their 
knowledge,  getting  63.8%  of  the  answers  correct,  but  assigning  a mean 
probability  of  .724. 

The  subjects  were  divided  into  three  subgroups  according  to  how 
knowledgeable  they  had  been:  the  best  subjects  (40  subjects  with  51  or 

more  correct  answers  out  of  75),  the  middle  subjects  (39  subjects  with 
46  to  50  correct  answers),  and  the  worst  subjects  (41  subjects  with 
fewer  than  46  correct  answers).  Separate  analyses  were  performed  for 
each  group.  Calibration  curves  appear  in  Figure  5,  with  the  corresponding 
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Figure  3 

Experiment  2:  "Mensa  mea  bona  est."  Effects  of  training  on  calibration 
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Figure  4 

Experiment  3:  Overall  calibration  curve. 
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TABLE  1 

Summary  Table  of  Calibration  Statistics 
for  Experiments  3 and  4 


Number  of 
Responses 


Percent 

Correct 


Mean 

Response 


Exp.  3 


All  subjects 

9000 

.638 

.724 

By  subject: 

Best  40  subjects 

3000 

.714 

.743 

Middle  39  subjects 

2925 

.643 

.711 

Worst  41  subjects 

3075 

.560 

.706 

By  subject  x item: 

Best  subjects-easy  items 

1532 

.847 

.796 

Middle  subjects-easy  items 

1472 

.800 

.747 

Worst  subjects-easy  items 

1516 

.695 

.733 

Best  sub j ects-hard  items 

1468 

.576 

.716 

Middle  sub j ects-hard  items 

1453 

.483 

.674 

Worst  subj ects-hard  items 

1559 

.429 

.679 

Exp.  4 

All  subjects 

5000 

.779 

.784 

By  subject  x item: 

Best  subjects-easy  items 

1450 

.923 

.862 

Worst  subjects-easy  items 

1450 

.847 

.820 

Best  subj ects-hard  items 

1050 

.655 

.705 

Worst  subjects-hard  items 

1050 

.512 

.681 

14 


statistics  in  Table  1.  These  data  strongly  suggest  that  the  more  one 
knows,  the  better  one's  calibration.  All  groups  tended  to  overconfidence 
but  the  most  knowledgeable  subjects  showed  the  least  overconfidence. 


Dividing  responses  according  to  item  difficulty  rather  than  subject 
proficiency  produced  much  the  same  result  (not  shown).  The  calibration 
curve  for  the  easiest  items  was  considerably  closer  to  the  identity 
diagonal  than  that  for  the  most  difficult  questions. 

Pushing  this  idea  one  step  further,  one  might  ask  how  well  calibrated 


were  the  best  subjects  on  the  easiest  items?  Here,  we  might  expect  to 
find  the  best  calibration.  The  data  of  Experiment  3 were  re-analyzed  to 
investigate  this  possibility.  Items  were  sorted  into  two  groups  according 
to  the  percentage  of  subjects  answering  them  correctly:  easy  items  (67% 

or  more  correct)  and  hard  items  (less  than  67%  correct).  Each  of  the 
three  groups  of  subjects  was  calibrated  for  each  of  the  two  groups  of 
items,  to  produce  the  six  calibration  curves  shown  in  Figure  6.  Summary 
statistics  are  shown  in  Table  1. 

Despite  some  irregularities  in  these  calibration  curves  due  to  the 
reduced  number  of  responses  per  data  point,  a pattern  of  roughly  parallel 
lines  emerged.  With  low  knowledge,  substantial  overconfidence  occurred. 
However,  when  the  percentage  of  correct  answers  was  high  (85%  for  the  best 
subjects  on  the  easy  items  and  80%  for  the  middle  subjects  on  the  easy  items), 
substantial  underconfidence  was  seen  (e.g.,  75%  of  the  .60  responses  were 
correct).  Calibration  appears  to  change  with  increased  knowledge,  but 
not  necessarily  for  the  better. 

Experiment  4 

Although  conducted  for  a somewhat  different  purpose  (see  below). 
Experiment  4 affords  a replication  of  the  above  analysis. 
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Figure  6 

Experiment  3 yet  again. 
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Method.  All  on-campus  graduate  students  in  the  Psychology  Department 
of  the  University  of  Oregon  were  asked  to  participate  in  this  experiment. 
Packets  with  stimuli  and  instructions  were  sent  to  all  64  graduate  students; 
50  were  returned  completed. 

The  stimuli  were  50  general  knowledge  items  (30  of  those  used  in 
Experiment  3 and  20  additional,  similar  items)  and  50  specially-written 
items  dealing  with  psychology  (e.g. , the  Ishihara  test  is  [a]  a perceptual 
test,  [b]  a social  anxiety  test;  Anna  Freud  is  Sigmund  Freud's  [a]  oldest 
child,  [b]  youngest  child).  The  two  types  of  items  were  randomly  inter- 
mixed in  the  stimulus  package. 

Results . Separate  calibration  curves  are  shown  in  Figure  7 for  four 
subsets  of  responses  obtained  by  splitting  the  subjects  into  best  and  worst 
at  the  median  (74.5%)  of  the  distribution  of  percentage  correct,  and 
splitting  the  items  into  easy  (at  least  75%  correct;  58  items)  and  hard 
(fewer  than  75%  correct;  42  items).  Summary  statistics  are  given  in  Table  1. 
For  these  analyses,  no  distinction  was  made  between  general  knowledge  and 
psychology  items.  The  same  pattern  of  almost  parallel  lines  found  in 
Figure  6 emerged  from  these  data. 

Effects  of  Chance  Fluctuations 

The  analytic  technique  used  in  Experiments  3 and  4,  in  which  the  data 
were  divided  into  subsets  as  a function  of  item  difficulty  and  subjects' 
performance,  is  vulnerable  to  random  fluctuations  which  could  artifactually 
produce  separation  between  the  calibration  curves  for  the  subsets.  Assume 
that  our  subjects  were  equally  knowledgeable  and  identically  calibrated. 

In  any  sample  of  their  responses,  some  will  probably  appear  more  knowledge- 
able by  chance.  The  same  chance  factors  that  led  them  to  have  a higher 
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overall  percent  correct  will  also  lead  them  to  have  a higher  hit  rate 


for  their  responses  of  .5,  .6,  etc.,  and  thus  have  an  elevated  calibration 
curve.  The  extent  to  which  such  chance  factors  could  lead  to  differences 
in  calibration  was  examined  by  simulating  the  results  of  Experiment  4. 

For  the  simulation,  all  subjects  were  assumed  to  have  exactly  the  same 
calibration,  which  was  taken  as  the  actual  calibration  derived  from 
pooling  their  responses  to  all  5,000  items  (100  items  for  each  of  50 
subjects).  Subjects'  original  probability  responses  were  retained  in  the 
simulation.  For  each  response,  the  correctness  of  the  chosen  alternative 
was  simulated  in  accordance  with  the  overall  calibration  curve.  For 
example,  since  in  the  real  data  86%  of  the  .90  responses  were  correct, 
in  the  simulated  data  each  response  of  .90  received  a simulated  outcome, 
either  correct — with  a probability  of  .86,  or  incorrect — with  a probability 
of  .14.  These  simulated  data  (the  original  probability  responses  with 
randomly  chosen  outcomes)  were  then  partitioned  into  four  subsets — best 
and  worst  subjects;  easy  and  hard  items.  The  calibration  for  each  sub- 
set was  computed.  The  entire  simulation  was  repeated  50  times.  Figure  8 
shows  the  average  calibration  curves  across  the  50  replications.  Figure  8 
is  directly  comparable  to  Figure  7;  it  is  based  on  the  same  data  except 
for  the  assumption  that  all  subjects  have  exactly  the  same  calibration. 

The  amount  of  separation  between  the  calibration  curves  in  Figure  8 
is  due  solely  to  chance  fluctuations.  This  separation  is  much  smaller 
than  the  separation  found  in  the  original  data  (Figure  7).  We  reject  the 
hypothesis  that  in  Figure  7 all  subjects  on  all  items  had  the  same  cali- 
bration. 
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Figure  8 

Results  of  a simulation  to  parallel  the  findings  of  the  previous  figure 


Tests  Varying  In  Difficulty  versus  Sub-tests  Varying  in  Difficulty 

The  previous  experiments  analyzed  subsets  of  items  actually  contained 
in  a single  test.  It  may  be  that  some  adaptation  to  the  overall  difficulty 
of  the  test  might  account  for  the  observed  overestimation  with  hard  items 
and  underestimation  with  easy  items.  This  possibility  was  explored  in 
Experiment  5. 

Experiment  5 

Method.  From  the  items  used  in  Experiment  3,  two  tests  of  50  items 
each  were  compiled.  Items  were  selected  in  pairs  according  to  the  percent 
of  subjects  answering  them  correctly  in  Experiment  3.  Each  item  in  the 
hard  test  was  matched  with  an  item  in  the  easy  test  that  had  been  answered 
correctly  by  an  additional  20%  of  subjects.  The  mean  percent  correct  for 
the  hard  test  was  60.4  (range,  46.2  to  77.5);  for  the  easy  test,  80.5 
(range,  66.2  to  97.5). 

The  two  tests  were  distributed  to  93  subjects;  48  received  the  hard 
test  and  45  the  easy  test. 

Results . Figure  9 compares  results  from  this  experiment  (the  "complete" 
tests)  with  those  from  Experiment  3 using  the  same  items  (the  "subset"  tests). 
Here,  too,  the  calibration  curve  depends  on  test  difficulty,  with  under- 
confidence on  the  easy  test  and  overconfidence  on  the  hard  test.  The 
similarity  between  the  calibration  curves  for  the  complete  tests  and  the 
subset  tests  suggests  that  artifactual  explanations  of  the  results  of 
Experiment  3 are  untenable. 

Eleven  items  of  intermediate  difficulty  were  used  in  both  the  hard 
and  the  easy  tests  of  Experiment  5 (these  were  the  hardest  of  the  easy  test 
and  the  easiest  of  the  hard  test) . Analyses  for  these  items  revealed  no 
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differences  in  percent  correct,  mean  response,  or  calibration  between  the 
two  groups.  Thus,  there  appeared  to  be  no  context  effects  in  responses 
to  these  items. 


Expertise 

Perhaps  the  categorization  of  items  into  "hard"  and  "easy"  does  not 
really  capture  the  essence  of  expertise.  Experts  might  be  better  calibrated 
not  only  because  they  know  the  correct  answer  for  more  of  the  items,  but 
also  because  they  have  thought  more  about  the  whole  topic  area,  and  thus 
can  more  readily  recognize  the  extent  and  the  limitations  of  their  know- 
ledge. The  following  analysis  searched  for  differences  in  calibration  due 
to  any  sort  of  "quality  of  insight"  that  experts  might  have  above  and  be- 
yond their  level  of  knowledge. 

Method . The  experts  were  the  50  graduate  students  in  the  Department 
of  Psychology  mentioned  in  the  description  of  Experiment  4.  This  experi- 
ment is  simply  a re-analysi6  of  that  data,  comparing  their  calibration  on 
the  50  items  pertaining  to  psychological  knowledge  with  the  50  general- 
knowledge  items. 

Results . The  psychology  subtest  and  general-knowledge  subtest  were 

virtually  identical  ’ p percent  correct  (75.7  vs.  76.0)  and  mean  probability 

response  (.780  vs.  .778).  Figure  10  shows  that  calibration  for  the  two 

2 

subtests  was  essentially  the  same.  Thus,  with  equal  knowledge  there  is 
no  evidence  that  expertise  in  a particular  subject  area  leads  to  better 


calibration. 


No  test  of  significance  of  the  difference  between  two  calibration  curves 
is  known.  However,  even  the  most  extreme  difference  in  Figure  10  (as- 
sociated with  the  responses  .80  to  .89)  has  a probability  of  .09  of 


being  due  to  chance,  assuming  a uniform  prior. 
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Intelligence 


The  subjects  in  Experiment  3 were  mostly  undergraduate  students  at- 
tending the  University  of  Oregon.  They  are  probably  less  intelligent,  on 
the  average,  than  the  graduate  student  subjects  of  Experiment  4,  who  are 
highly  selected  for  intelligence  by  the  admissions  procedures  of  the 
Psychology  Department.  We  are  thus  able  to  investigate  the  effects  of 
intelligence  on  calibration. 

Method.  Subtests  of  73  items  each,  matched  item  by  item  in  diffi- 
culty, were  created  from  the  Experiment  3 (regular  volunteer  subjects) 
and  Experiment  4 (graduate  student  subjects)  data. 

Results.  Thirty  items  were  common  to  both  groups.  Responses  to 
them  revealed  the  graduate  students'  superior  knowledge.  They  averaged 
76.2  percent  correct  on  these  items,  compared  with  the  regular  volunteers' 
mean  of  63.9  percent  correct.  The  graduate  students  had  fewer  correct 
answers  on  only  4 of  the  30  items. 

The  matching  process  succeeded  in  producing  subtests  with  a mean 
percent  correct  of  69.8  for  the  graduate  students  and  69.2  for  the  regu- 
lar volunteers.  Mean  probability  responses  were  .747  and  .751,  respectively. 

Figure  11  shows  the  calibration  of  the  two  groups.  It  appears  that 
the  graduate  students  may  be  slightly  better  calibrated  at  the  extremes. 

The  differences,  however,  seem  slight  when  compared  with  differences  in 
calibration  due  to  test  difficulty. 

Distribution  of  Responses 

Figure  12  presents  the  proportion  of  subjects'  probability  responses 
that  fell  into  each  response  category.  These  proportions  are  shown  for 
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all  groups  or  subgroups  of  all  experiments,  ordered  by  percent  correct. 
Subjects  showed  a definite  tendency  to  make  more  use  of  the  high  end  of 
the  response  scale  for  the  easiest  tests.  However,  this  tendency,  while 
in  the  right  direction,  was  less  than  it  should  have  been.  While  the  per- 
cent correct  ranged  from  43  to  92,  the  range  of  mean  probability  was  only 
.65  to  .86.  It  is  this  insufficient  discrimination  which  leads  to  under- 
estimation with  easy  tests  and  overestimation  with  hard  tests. 

The  other  striking  attribute  of  Figure  12  is  the  great  frequency 
of  extreme  responses  (.5  and  1.0).  While  no  response  category  was  un- 
used, over  all  experiments,  subjects  used  the  extreme  categories  for 
about  half  their  responses.  This  inclination  to  treat  the  task  as  di- 
chotomous (either  "I  know  the  answer" — 1.0,  or  "I  don't  know  the  answer" — 
.50)  appears  to  have  been  less  pronounced  in  Experiments  1 and  2,  with 
relatively  few  items  all  dealing  with  the  same  topic,  than  in  Experiments 
3,  4,  and  5,  which  used  many  items  concerning  diverse  topics. 

The  effect  on  calibration  of  the  tendency  to  avoid  using  probabilities 
other  than  .5  and  1.0  was  examined  with  the  data  of  Experiment  3.  Subjects 
were  divided  into  three  groups:  heavy  users  of  .5  and  1.0  (49  subjects 

using  these  two  responses  more  than  50%  of  the  time;  mean  use,  67.7%); 
medium  users  of  .5  and  1.0  (33  subjects  using  .5  and  1.0  between  41%  and 
49%  of  the  time;  mean  use,  46.9%);  and  light  users  (38  subjects  using  .5 
and  1.0  40%  or  less  of  the  time;  mean  use,  34.4%).  The  three  groups  were 
similar  in  percent  of  items  answered  correctly  (65%,  64%,  62%,  respectively). 
Their  calibration  curves  (not  shown)  were  highly  similar,  and  all  three 
groups  showed  the  same  gross  overconfidence  with  hard  items  and  mild  under- 
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confidence  with  easy  items.  We  thus  found  no  support  for  the  notion  that 
the  tendency  to  avoid  extreme  probability  responses,  as  an  individual  dif- 
ference, affects  calibration. 

Discussion 

At  the  outset,  we  must  caution  the  reader  about  some  limitations  on 
the  genoralizability  of  these  findings: 

1)  All  subjects  were  naive  about  probabilities,  and  received  only 

minimal  training  via  experimental  instructions.  Even  modest  additions 

3 

to  the  instructions  might  lead  to  pronounced  changes  in  calibration. 

2)  The  items  always  had  two  alternatives,  and  the  subjects  were 
restricted  to  probabilistic  responses  greater  than  or  equal  to  .5.  Use 
of  true-false  or  multi-alternative  items,  or  elicitation  of  the  full 
range  of  probabilities,  could  affect  calibration. 

3)  Because  of  the  large  amounts  of  data  needed  for  stable  estimation 
of  calibration  curves,  only  group  results  are  reported  here.  It  seems 
reasonable  that  important  individual  differences  exist  in  calibration, 
but  this  possibility  has  so  far  received  only  the  most  rudimentary  explor- 
ation (Adams  & Adams,  1961). 

Nonetheless,  strong  effects  emerged.  People  do  show  some  realism 
and  sensitivity  in  their  probability  assessments,  although  in  general  they 
are  not  well  calibrated.  With  difficult  items,  assessors  are  overconfi- 
dent; with  easy  items,  they  are  under confident. 

In  a recent  study  (Slovic,  Fischhoff,  & Lichtenstein,  1976),  we  found 
little  difference  in  the  calibration  of  odds  responses  produced  with 
minimal  and  with  extensive  instructions. 


■ 
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The  strikingly  different  calibration  curves  for  items  of  varying 
difficulty  are  a direct  result  of  subjects'  insensitivity  to  how  much  they 
really  know.  Among  the  items  for  which  they  believe  that  they  have  a 50% 
chance  of  knowing  the  correct  answer,  the  appropriate  probability  may  be 
anywhere  between  .45  and  .85.  When  they  estimate  1.00,  the  appropriate 
probability  may  be  between  .55  and  .95  (Figure  6) . The  ease  with  which 
the  different  calibration  curves  were  constructed  from  the  fairly  repre- 
sentative sets  of  items  used  in  Experiments  3,  4,  and  5,  and  the  large 
numbers  of  responses  in  each  category  for  even  the  most  extreme  curves, 
indicate  that  subjects'  inability  to  make  discriminations  is  widespread 
(i.e.,  there  are  not  only  some  instances  in  which,  for  example,  people 
should  be  saying  .75  when  they  actually  say  .50,  but  many  such  instances). 

Although  subjective  probabilities  have  a prominent  role  in  many 
psychological  theories,  the  study  of  probabilities  themselves  has  been 
atheoretical  in  most  cases  (including  the  present  study;  see  also 
Lichtenstein,  Fischhoff,  and  Phillips,  1976).  While  there  have  been  some 
suggestions  for,  or  fragments  of,  process  theories  of  calibration  (Pitz, 

1974;  Slovic,  1972;  Tversky  & Kahneman,  1974),  only  Pitz  (1974)  predicts 
a decrease  in  overconfidence  as  knowledge  increases. 

Practical  Implications.  Aside  from  their  theoretical  import  for  the 
psychologist  interested  in  how  people  perform  judgments  under  conditions 
of  uncertainty,  these  results  have  strong  implications  for  those  whose 
jobs  involve  actually  making  and  taking  responsibility  for  such  judgments. 

With  the  development  of  sophisticated  information  processing  and  decision 
analytic  techniques,  operations  as  diverse  as  intelligence  analysis,  corporate 
planning,  environmental  impact  assessment  and  nuclear  power  engineering 
utilize  explicit  probability  assessments  (Fischhoff,  1976).  Users 
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of  these  approaches  should  consider  results  like  the  present  ones  in  de- 


termining how  much  faith  to  put  in  the  results  of  their  analyses.  Similarly, 


psychologists  who  elicit  subjective  probability  estimates  in  the  study  of 


behavioral  phenomena  might  think  twice  before  taking  them  at  face  value — 


or  expecting  too  much  of  them. 


In  addition  to  their  cautionary  value,  these  results  may  also  help 


improve  the  quality  of  probabilistic  analyses.  Assume  that  in  the  context 


of  a practical  problem  using  judgments  of  the  type  studied  here,  a judge 


reports  a probability  of  .90.  From  Figure  4,  we  know  that  a better  estimate 


of  the  appropriate  probability  is  .71,  and  would  do  better  treating  it 


as  such.  Although  such  "correction  after  the  fact"  is  better  than  taking 


biased  judgments  at  face  value,  the  revised  assessments  may  still  be 


inappropriate.  In  the  present  example,  even  though  our  best  guess  of 


the  appropriate  probability  is  .71,  anything  between  .40  and  .90  might 


be  even  better,  depending  on  the  difficulty  of  the  item  involved. 


If  we  know  how  difficult  the  item  is,  then  we  can  make  a much  more 


accurate  correction.  In  practice,  however,  such  situations  will  be  rare. 


To  know  how  difficult  an  item  is,  we  must  know  the  correct  answer.  But 


if  we  know  the  correct  answer,  we  will  not  have  any  practical  need  for 


the  judge's  assessment.  Such  assessments  are  valuable  only  when  the  cor- 


rect answer  is  not  known.  Short  of  knowing  the  correct  answer,  the  only 


way  to  capitalize  on  the  relationship  between  item  difficulty  and  type  of 


miscalibration  seems  to  be  to  assume  something  about  the  difficulty  of 


the  items  in  the  world  in  which  our  judge  is  functioning.  The  distribution 


of  judges ' responses  (as  shown  in  Figure  12)  could  be  exploited  for  this 


purpose.  Across  the  16  groups  or  subgroups,  there  is  a correlation  of  .91 
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